71 research outputs found
Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
Deep Neural Networks (DNNs) are becoming an important tool in modern
computing applications. Accelerating their training is a major challenge and
techniques range from distributed algorithms to low-level circuit design. In
this survey, we describe the problem from a theoretical perspective, followed
by approaches for its parallelization. We present trends in DNN architectures
and the resulting implications on parallelization strategies. We then review
and model the different types of concurrency in DNNs: from the single operator,
through parallelism in network inference and training, to distributed deep
learning. We discuss asynchronous stochastic optimization, distributed system
architectures, communication schemes, and neural architecture search. Based on
those approaches, we extrapolate potential directions for parallelism in deep
learning
Bridging Control-Centric and Data-Centric Optimization
With the rise of specialized hardware and new programming languages, code
optimization has shifted its focus towards promoting data locality. Most
production-grade compilers adopt a control-centric mindset - instruction-driven
optimization augmented with scalar-based dataflow - whereas other approaches
provide domain-specific and general purpose data movement minimization, which
can miss important control-flow optimizations. As the two representations are
not commutable, users must choose one over the other. In this paper, we explore
how both control- and data-centric approaches can work in tandem via the
Multi-Level Intermediate Representation (MLIR) framework. Through a combination
of an MLIR dialect and specialized passes, we recover parametric, symbolic
dataflow that can be optimized within the DaCe framework. We combine the two
views into a single pipeline, called DCIR, showing that it is strictly more
powerful than either view. On several benchmarks and a real-world application
in C, we show that our proposed pipeline consistently outperforms MLIR and
automatically uncovers new optimization opportunities with no additional
effort.Comment: CGO'2
Big Data causing Big (TLB) Problems: Taming Random Memory Accesses on the GPU
GPUs are increasingly adopted for large-scale database processing, where data accesses represent the major part of the computation. If the data accesses are irregular, like hash table accesses or random sampling, the GPU performance can suffer. Especially when scaling such accesses beyond 2GB of data, a performance decrease of an order of magnitude is encountered. This paper analyzes the source of the slowdown through extensive micro-benchmarking, attributing the root cause to the Translation Lookaside Buffer (TLB). Using the micro-benchmarks, the TLB hierarchy and structure are fully analyzed on two different GPU architectures, identifying never-before-published TLB sizes that can be used for efficient large-scale application tuning. Based on the gained knowledge, we propose a TLB-conscious approach to mitigate the slowdown for algorithms with irregular memory access. The proposed approach is applied to two fundamental database operations - random sampling and hash-based grouping - showing that the slowdown can be dramatically reduced, and resulting in a performance increase of up to 13Ă—
Learning Combinatorial Node Labeling Algorithms
We present a graph neural network to learn graph coloring heuristics using
reinforcement learning. Our learned deterministic heuristics give better
solutions than classical degree-based greedy heuristics and only take seconds
to evaluate on graphs with tens of thousands of vertices. As our approach is
based on policy-gradients, it also learns a probabilistic policy as well. These
probabilistic policies outperform all greedy coloring baselines and a machine
learning baseline. Our approach generalizes several previous machine-learning
frameworks, which applied to problems like minimum vertex cover. We also
demonstrate that our approach outperforms two greedy heuristics on minimum
vertex cover
- …